Audio Processing Tasks

1. Audio Classification

Audio classification is a fundamental problem in the field of audio processing. The task is essentially to extract features from the audio, and then identify which set of class the audio belongs to. Many useful applications pertaining to audio classification can be found in the wild – such as genre classification, instrument recognition and artist identification.

The task is also the most explored topic in audio processing with many papers published on the topic in the last year. In fact, we have also hosted a practice hackathon for community collaboration in solving this particular task.

Whitepaper – http://ieeexplore.ieee.org/document/5664796/?reload=true

A common approach to solve an audio classification task is to pre-process the audio inputs to extract useful features, and then apply a classification algorithm on it. For example, in the case study below – given a 5 second excerpt of a sound the task is to identify which class does belong to – whether it is a dog barking or a drilling sound. An approach to the problem as mentioned in the article is to extract an audio feature called MFCC and then pass it though a neural network to get the appropriate class.

Case Study – https://www.analyticsvidhya.com/blog/2017/08/audio-voice-processing-deep-learning/

2. Audio Fingerprinting

The aim of audio fingerprinting is to determine a digital “summary” of the audio. This is done to identify the audio from an audio sample. Shazam is an excellent example of an application of audio fingerprinting in the industry. It recognises the music on the basis of the first two to five seconds of a song. This task is useful as far as industry standards go, but still there are situations where the system fails especially where there is a high amount of background noise.

Whitepaper – http://www.cs.toronto.edu/~dross/ChandrasekharSharifiRoss_ISMIR2011.pdf

To solve this problem, an approach could be to represent the audio in a different manner, so that it is easily deciphered and then find out the patterns that differentiates the audio from others. In the case study below, the author converts raw audio to spectrograms and then uses peak finding and fingerprint hashing algorithms to define the fingerprints of that audio file.

Case Study – http://willdrevo.com/fingerprinting-and-audio-recognition-with-python/

3. Automatic Music Tagging

Music Tagging is a more complex version of audio classification. Here, we can have multiple classes that each audio may belong to, aka, a multi-label classification problem. A potential application of this task is to create metadata for the audio so that it can be searched later on. Deep learning has helped solve this task to a certain extent which can be seen in the case study below.

Whitepaper – https://link.springer.com/article/10.1007/s10462-012-9362-y

As seen with most of the tasks, the first step always is to extract features from the audio sample. Then according to the nuances of the audio (for example, if the audio contains very less singing voice as compared to the instruments used, a tag could be “instrumental”). This can be done either by machine learning or deep learning methods. The case study mentioned below uses deep learning to solve the problem, specifically convolutional recurrent neural network along with Mel Frequeny Extraction.

Case Study – https://github.com/keunwoochoi/music-auto_tagging-keras

4. Audio Segmentation

Segmentation literally means dividing a particular object into parts (or segments) based on a defined set of characteristics. Segmentation, especially for audio data analysis, is an important pre-processing step. This is because we can segment a noisy and lengthy audio signal into short homogeneous segments (handy short sequences of audio) which are used for further processing. An application of the task is heart sound segmentation, i.e. to identify sounds specific to the heart.

Whitepaper – http://www.mecs-press.org/ijitcs/ijitcs-v6-n11/IJITCS-V6-N11-1.pdf

An approach to solve this problem is to convert it into a supervised learning problem, where each time stamp is classified on the basis of the segments required. Then apply an audio classification approach to solve the problem. In the case study below, the task is to segment the heart sound into two segments (lub and dub), so that you can identify an anomaly in each of the segment. It is solved by using audio feature extraction and then apply deep learning for classification.

Case Study – https://www.analyticsvidhya.com/blog/2017/11/heart-sound-segmentation-deep-learning/

5. Audio Source Separation

Audio Source Separation consists of isolating one or more source signals from a mixture of signals. One of the most common applications of this is identifying the lyrics from the audio for simultaneous translation (karaoke, for instance). This is a classic example shown in Andrew Ng’s machine learning course where he separates the sound of the speaker from the background music.

Whitepaper – http://ijcert.org/ems/ijcert_papers/V3I1103.pdf

A typical usage scenario involves loading an audio file, computing a time-frequency transform to obtain a spectrogram, and using some of the source separation algorithm such as non-negative matrix factorization to obtain a time-frequency mask. The mask is then multiplied with the spectrogram and the result is converted back to the time domain.

Case Study – https://github.com/IoSR-Surrey/untwist

6. Beat Tracking

As the name suggests, the goal here is to track the location of each beat in a collection of audio files. Beat tracking can be utilized to automate time-consuming tasks that must be completed in order to synchronize events with music. It is useful in various applications, such as video editing, audio editing, and human-computer improvisation.

Whitepaper – https://www.audiolabs-erlangen.de/content/05-fau/professor/00-mueller/01-students/2012_GroschePeter_MusicSignalProcessing_PhD-Thesis.pdf

An approach to solve beat tracking would be parse the audio file and use an onset detection algorithm to track the beats. Although the techniques used to for onset detection rely heavily on audio feature engineering and machine learning, deep learning can easily be used here to optimize the results.

Case Study – https://github.com/adamstark/BTrack

7. Music Recommendation

Thanks to the internet, we now have millions of songs we can listen to anytime. Ironically however, this has made it even harder to discover new music because of the plethora of options out there. Music recommendation systems help deal with this information overload by automatically recommending new music to listeners. Content providers like Spotify and Saavn have developed highly sophisticated music recommendation engines. These models leverage the user’s past listening history among many other features to build customized recommendation lists.

Whitepaper – https://pdfs.semanticscholar.org/7442/c1ebd6c9ceafa8979f683c5b1584d659b728.pdf

We can tackle the problem of predicting listening preferences from audio signals by training a regression/deep learning model to predict the latent representations of songs that were obtained from a collaborative filtering model. This way, we could predict the representation of a song in the collaborative filtering space, even if no usage data was available.

Case Study – http://benanne.github.io/2014/08/05/spotify-cnns.html

8. Music Retrieval

One of the most difficult tasks in audio processing, Music Retrieval essentially aims to build a search engine based on audio. Although we can do this by solving sub-tasks like audio fingerprinting, this task encompasses much more that that. For example, you also have to solve different smaller tasks for different types of music retrieval (timbre detection would be great for gender identification). Currently, there is no other system that has been developed to match the industry expected standards.

Whitepaper – http://www.nowpublishers.com/article/Details/INR-042

To solve music retrieval, the task is divided into simpler ones, which include tonal analysis (e.g. melody and harmony) and rhythm or tempo (e.g. beat tracking). Then on the basis these individual analysis, information is extracted which is used for retrieval of similar audio samples.

Case Study – https://youtu.be/oGGVvTgHMHw

9. Music Transcription

Music Transcription is another challenging audio processing task. It comprises of annotating audio and creating a kind of “sheet” for generating music from it at a later point of time. The manual effort involved in transcribing music from recordings can be vast. It varies enormously depending on the complexity of the music, how good our listening skills are and how detailed we want our transcription to be.

Whitepaper – http://ieeexplore.ieee.org/abstract/document/7955698

The approach for music transcription is similar to that of speech recognition, where musical notes are transcribed into lyrical excerpts of instruments.

Case Study – https://youtu.be/9boJ-Ai6QFM

10. Onset Detection

Onset detection is the first step in analysing an audio/music sequence. For most of the tasks mentioned above, it is somewhat necessary to perform onset detection, i.e. detecting the start of an audio event. Onset detection was essentially the first task that researchers intended to solve in audio processing.

Whitepaper – http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.332.989&rep=rep1&type=pdf

Onset detection is typically done by

Computing a spectral novelty function.
Finding peaks in the spectral novelty function.
Backtracking from each peak to a preceding local minimum. Backtracking can be useful for finding segmentation points such that the onset occurs shortly after the beginning of the segment.

Case Study – https://musicinformationretrieval.com/onset_detection.html

End Notes

In this article, I have mentioned a couple of tasks that can be looked at when solving audio processing problems. I hope you find the article insightful in dealing with audio/speech related projects.